Declarative checkpoint config conversion (Llama pilot) by jlamypoirier · Pull Request #508 · ServiceNow/Fast-LLM

jlamypoirier · 2026-05-05T18:02:39Z

Summary

First step of the conversion-simplification refactor. Reintroduces declarative config-conversion primitives, applied within the post-#362 modular per-section structure, and migrates Llama as the pilot to validate the design.

Three sequential commits:

Reclassify architecture-impacting fields under FieldHint.architecture — eight fields (attention dense_layer / softmax_scale_power, MLP activation, MoE router, four Llama3 / five Yarn rotary scaling fields, StochasticMixer main_mixer_name, vision patch height/width). These drive the new coverage check.
Add declarative ConfigConverter primitives and section-converter ABC in fast_llm/engine/checkpoint/external.py. Eight primitives (Rename, ConstantExport, ConstantImport, Default, Optional, Ignored, Custom, Nested, Dispatch) plus ConfigSectionConverter. Walker is implicit — NestedConfigConverter and DispatchConfigConverter call public import_config/export_config so subclass overrides participate. Coverage check fires only when type(config) exactly matches the converter's declared fast_llm_config_class, so unmigrated subclasses (Mixtral on Llama, Qwen2's _check_config override, etc.) keep working through super().
Migrate Llama config converters to declarative primitives. Eight section converters cover normalization/MLP/attention/block/embeddings/head/decoder/base-model. Weight side unchanged. LlamaDecoderConverter stays imperative (Fixed/Pattern block-sequence dispatch doesn't fit cleanly). _check_config is retained as an overridable hook. PEFT non-default values now fail loudly on export instead of being silently dropped.

Notable shape decisions (open to course-correction)

Coverage check is type-strict (type(config) is cls.fast_llm_config_class). Strict subclasses defer to a more specific converter. This was needed to keep Mixtral working through super().export_config() on MoEMLPConfig while only Llama is migrated.
NestedConfigConverter is flat-merge only. The transformer side is assumed flat. Non-flat HF cases (Apriel2 mixers) will use DispatchConfigConverter with an hf_path, or CustomConfigConverter.
No global type-keyed registry. Sub-converter dispatch is local: parents declare NestedConfigConverter(field, converter_class) for fixed types, DispatchConfigConverter(field, registry) for polymorphic ones. Subclasses override sub-converter classes the same way as today's ClassVar[type] pattern.
parent_context plumbing is dropped for now (was speculative, unused in Llama). Will re-introduce as an explicit kwarg when Apriel migration needs it for mamba sibling-field defaults.
IgnoredConfigConverter is permissive — silently passes architecture fields through without check. Used for ParameterConfig sub-fields (init/lr_scale only, no architecture sub-fields) and for fields where Llama HF format genuinely has no representation. PEFT (which IS architecture-significant when configured) uses CustomConfigConverter with an explicit Assert.custom(isinstance, config.peft, NoPeftConfig) instead.

Verification

Live round-trip parity for Llama-3, Qwen2, Mistral, Mixtral, MTP-Llama with realistic HF configs.
Coverage check fires on missing declarations (verified by removing head_size).
Constant assertions fire on non-default softmax_scale_power and on configured PEFT.
pytest tests/models/test_checkpoint.py --models gpt: 139 passed, 0 failed across llama / qwen_2 / mistral / mixtral / mtp_llama / apriel2_attn / llava / diffusion_llama.

Test plan

pytest -v -n 6 tests/models/test_checkpoint.py 2>&1 | tee /tmp/fast_llm_tests/pytest_out.txt
pytest -v -n 6 tests/models/test_hf_roundtrip.py
pytest -v -n 6 --models gpt tests/
pytest -v -n 6 fast_llm_external_models/tests/ (separate invocation per CLAUDE.md)
Manual smoke: fast-llm convert --input.format llama --input.path <ref> --output.format llama --output.path <tmp>; reload both and compare configs.

What's not in this PR

Phase 2 steps 3–8 of the plan (apriel2 / mistral / qwen2 / mtp_llama / mixtral / diffusion / apriel / multimodal migrations + cleanup) and the weight-converter declarative refactor are deferred. The framework is built so they can land incrementally on top of this.

🤖 Generated with Claude Code

Eight config fields whose values directly affect model architecture were tagged as feature/core/(none). They drive the upcoming declarative-converter coverage check, which uses FieldHint.architecture as the source of truth for "must be handled by every checkpoint format". - AttentionConfig.dense_layer (output projection presence) - AttentionConfig.softmax_scale_power (attention scaling) - MLPConfig.activation (forward-pass activation type) - MoEMLPConfig.router (routing weights drive token assignment) - Llama3RotaryConfig: scale_factor, low_frequency_factor, high_frequency_factor, original_context_length - YarnRotaryConfig: scale_factor, attention_factor, beta_fast, beta_slow, original_context_length - StochasticMixerConfig.main_mixer_name (selects inference mixer) - PatchEmbeddingsConfig.patch_height/patch_width (input tokenization) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reintroduces the declarative config-conversion shape that pre-dated PR #362, applied within the post-#362 modular per-section structure. Replaces the imperative import_config/export_config bodies with a small set of named primitives and a recursive walker driven by per-section declarations. Primitives in fast_llm.engine.checkpoint.external: - RenameConfigConverter — 1:1 path rename - ConstantExportConfigConverter — write constant on export, assert on import - ConstantImportConfigConverter — assert on export, inject on import - DefaultConfigConverter — rename with HF-side fallback - OptionalConfigConverter — emit/import only when non-sentinel - IgnoredConfigConverter — declare a field as intentionally not converted - CustomConfigConverter — escape hatch for cross-field transforms - NestedConfigConverter — recurse into a fixed-typed sub-config; flat-merges HF output into the parent (transformer side is assumed flat) - DispatchConfigConverter — runtime type dispatch for polymorphic sub-configs ConfigSectionConverter is the per-Fast-LLM-class converter base. Subclasses declare their conversion via _create_config_converters() and inherit import_config/export_config concretely. The architecture-coverage check fires only when type(config) exactly matches the converter's declared fast_llm_config_class — strict subclass types defer to a more specific converter, allowing yet-to-be-migrated subclasses (e.g., Mixtral on Llama) to call super().export_config() without tripping the parent's check on fields the parent doesn't know about. The walker is implicit: NestedConfigConverter / DispatchConfigConverter call the public import_config/export_config on the sub-converter class so subclass overrides participate, rather than a private path that bypasses them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pilot of the new ConfigSectionConverter framework. Each Llama section converter (Normalization/MLP/Attention/Block/Embeddings/Head/BaseModel) now declares its conversion via _create_config_converters() instead of imperative import_config/export_config bodies. Weight side is unchanged. Notable shape decisions: - LlamaDecoderConverter stays as a regular (imperative) class because Fixed/Pattern block-sequence dispatch doesn't lend itself to the declarative shape. LlamaBaseModelConverter wires it in via a small CustomConfigConverter; subclasses (Mistral, Qwen2, MTP-Llama, ...) continue to plug in different block converters via block_converter_class. - _check_config is retained as an overridable classmethod and called from the linear_layers CustomConfigConverter, so Qwen2 can keep its asymmetric Q/K/V bias rule without re-implementing the export. - IgnoredConfigConverter is used for ParameterConfig sub-fields with no architecture-significant content (weight, output_weight, word_embeddings), and for prediction_heads (which Llama HF doesn't expose; subclass MTP-Llama adds it imperatively). - peft uses CustomConfigConverter to assert NoPeftConfig on export. Llama HF format cannot represent PEFT, so a configured LoRA now fails loudly rather than being silently dropped. - Rotary remains in CustomConfigConverter — the v4/v5 transformers split (rope_theta/rope_scaling vs. rope_parameters) and three rope_type variants don't fit pure rename primitives. Verified with live round-trips of Llama-3, Qwen2, Mistral, Mixtral, and MTP-Llama HF configs, plus tests/models/test_checkpoint.py for all GPT formats (139 passed, 0 failed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `_validate_export(cls, config)` classmethod hook on `ConfigSectionConverter`, called automatically from `export_config` after the architecture-coverage check. Replaces five `CustomConfigConverter`-as-validator blocks (`linear_layers`/`layers` in attention and MLP, `position_embeddings` in embeddings, `peft` in base model, plus the `_check_config` chain on attention) with `IgnoredConfigConverter` for field-claiming + small `_validate_export` overrides. Mistral and Qwen2 rename their `_check_config` overrides accordingly; Pixtral's imperative export updates its `cls._check_config(config)` call site. Also addresses several reviewer-flagged correctness/cleanup items: - Drop the half-removed `parent_context` parameter from every primitive's `import_to` signature (and from `CustomConfigConverter`'s `import_fn`). It was unreachable through the walker. - `_check_architecture_coverage` now reads `cls.fast_llm_config_class` directly instead of `getattr(..., None)`, surfacing missing class-attribute declarations as `AttributeError` rather than silently disabling the safety net. - Drop the unused `hf_paths` parameter from `CustomConfigConverter.__init__`. There is no symmetric HF-side coverage check yet, so the field was cosmetic. - Add a TODO note in `_check_architecture_coverage` documenting that the `MoEMLPConfig`/`MambaConfig`/etc. safety net is gated on later migrations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The dict of named per-block configs is unambiguously architecture metadata; without an explicit hint it defaulted to `unknown`, hiding it from the architecture-coverage check used by declarative checkpoint converters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two additions, both required by Apriel2's nested HF schema: - `NestedConfigConverter` gains an optional `hf_path` kwarg. When set, the sub-converter's output is placed under that nested key instead of being flat-merged. Existing flat-merge behavior is unchanged when `hf_path` is omitted. - New `TypedDictContainerConfigConverter` for `dict[str, Config]` fields where each entry is round-tripped through a per-class section converter. Polymorphic dispatch via the entry's runtime type on export and the HF discriminator on import. A homogeneous mode (single registered class with `hf_type_name = None`) skips the discriminator entirely. Both `DispatchConfigConverter` and `TypedDictContainerConfigConverter` now also inject the Fast-LLM `dynamic_type_name` discriminator into the imported sub-dict so the parent's `from_dict` dispatches to the right `Config` subclass without a separate ConstantImport. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Stress-tests the framework's polymorphic dispatch and typed-dict support: Apriel2's HF schema is nested (`decoder.block.mixer.{...}`, `head.normalization`, `mixers.{name}`) and the mixer field is heterogeneously polymorphic (Attention/Mamba/StochasticMixer/GDN/KDA). Migrated converters: per-mixer (Attention/Mamba/GDN/KDA), the StochasticMixer container (driven by TypedDictContainer over a leaf-mixer registry), per-normalization (RMS/LayerNorm/NoNorm), MLP, Block, Fixed/Pattern decoder variants (selected by Dispatch on runtime BlockSequenceConfig type), Head, and BaseModel. The imperative weight-side `get_converters` methods are preserved unchanged so the multimodal Apriel2 converter (which inherits from the text-only one) keeps working without modification. PatternDecoder's `blocks` dict uses the homogeneous mode of TypedDictContainer (single-class registry, no discriminator). The attention rotary-type translation (default ↔ mistral_1d) and Mamba's auxiliary HF fields (d_conv, conv_bias, dt_proj_bias derived from linear-config bias flags) remain on `CustomConfigConverter` since they're shape-changing transforms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…primitives Each format inherits Llama's `_create_config_converters` and replaces only the fields that diverge: * Mistral: ConstantImportConfigConverter pinning `add_linear_biases=False` for attention and MLP (HF format has no `attention_bias`/`mlp_bias`); rename `window_size` <-> `sliding_window`. * Qwen2: ConstantImportConfigConverter for `add_linear_biases`; CustomConfigConverter for `head_size` (no `head_dim` HF field, derive on import); CustomConfigConverter for per-layer biases (always Q/K/V=True, dense=False); the head_dim relationship `heads * head_size == hidden_size` moves to `_validate_export` on the base-model converter; the use_mrope guard moves to `import_config`. * MTP-Llama: RenameConfigConverter for `prediction_heads` (Llama blanket-ignores it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`MixtralMLPConverter` switches its `fast_llm_config_class` to `MoEMLPConfig` so the architecture-coverage check sees MoE-specific fields. The config-side overrides: * `add_linear_biases` -> ConstantImportConfigConverter (Mixtral has no `mlp_bias`). * `experts` <-> `num_local_experts` and `experts_per_token` <-> `num_experts_per_tok` via RenameConfigConverter. * `shared_experts=0` and `routing=topk` pinned via ConstantImportConfigConverter so they round-trip cleanly without an HF representation. * `router` covered by IgnoredConfigConverter (Mixtral's gate is a default `LinearConfig`). The Fast-LLM dynamic-type discriminator (`type: "moe"`) is injected via an `import_config` override since the MLP is wrapped via `NestedConfigConverter` rather than `DispatchConfigConverter`. Diffusion-Dream and Diffusion-Llama need no migration: they only override `architecture`, `get_transformers_configuration_class`, and `_export_config` (auto_map). They inherit the declarative converters from their parents (Qwen2 and Llama). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…itives `AprielMambaConverter`, `GatedDeltaNetConverter`, and `KimiDeltaAttentionConverter` become `ConfigSectionConverter` subclasses with their HF-side fields nested under the appropriate HF subkey (`ssm_cfg` for Mamba, `linear_attn_config` for GDN/KDA). Mamba's three sibling-default fields (`d_inner`, `d_xb`, `dt_rank`) read the HF root's `hidden_size` directly via `DefaultConfigConverter.hf_default_fn` / `CustomConfigConverter`, removing the need for an explicit `parent_context` plumbing through the framework. The per-layer convolution and dt biases use `CustomConfigConverter` to pick up the mixer-wide `add_linear_biases` fallback when unset; the existing `_check_config` per-layer assertions move to `_validate_export`. `AprielBlockConverter` (the per-block dispatcher) and `AprielDecoderConverter` (the `hybrid_block_layout` driver) stay imperative because Apriel's HF format encodes the mixer type in a parent-level list rather than a per-block discriminator, which `DispatchConfigConverter` doesn't model. The `type: "mamba"`/`"gdn"`/`"kda"` Fast-LLM discriminator is injected via a one-line `import_config` override on each leaf converter (same pattern Mixtral uses). The HF format has no test coverage in `tests/models/test_checkpoint.py` or `tests/models/test_hf_roundtrip.py`, so verification was a synthesized live round-trip covering each mixer leaf plus a hybrid attention+Mamba pattern decoder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…larative primitives `PixtralNormalizationConverter` collapses to a single `_create_config_converters` override that pins `epsilon=1e-5` via `ConstantImportConfigConverter` (asserts on export, injects on import; no HF write). `PixtralEmbeddingsConverter` becomes a `ConfigSectionConverter` with declarations for `patch_height` (rename to `patch_size`), `patch_width` (mirror `patch_size` on import), `num_channels` (export-only constant 3), nested `normalization`, and an `IgnoredConfigConverter` for `patch_embeddings`. The `patch_height == patch_width` and `patch_embeddings.bias.enabled in (None, False)` checks move to `_validate_export`. The remaining Llava and Apriel2 multimodal converters stay imperative: they're cross-section aggregators (vision_config + text_config + top-level merge) whose shape doesn't fit a single ConfigSectionConverter, often with parent-context dependencies (e.g., the adapter's intermediate_size derives from the text model's hidden_size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jlamypoirier and others added 3 commits May 5, 2026 18:33

jlamypoirier force-pushed the jlp_simplify_conversion branch from 5567a71 to 0c406db Compare May 5, 2026 22:33

jlamypoirier and others added 8 commits May 6, 2026 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Declarative checkpoint config conversion (Llama pilot)#508

Declarative checkpoint config conversion (Llama pilot)#508
jlamypoirier wants to merge 11 commits intomainfrom
jlp_simplify_conversion

jlamypoirier commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented May 5, 2026

Summary

Notable shape decisions (open to course-correction)

Verification

Test plan

What's not in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant